After the variant call process the VCF file generated contains some artefacts that affect downstream analysis. Alignment artifacts can occur whenever there is sufficient sequence similarity between two or more regions in the genome to confuse the alignment algorithm. This can occur when the aligner for whatever reason overestimate how uniquely a read maps, thereby assigning it too high of a mapping quality. It can also occur through no fault of the aligner due to gaps in the reference, which can also hide the true position to which a read should map. Even thought GATK solve most of these artifacts some still persists. In this tutorial we will filter out artifacts from our variant calls and generate a ‘cleaner’ set of loci for downstream analysis.
For this step, the .vcf file is still too large to be
manipulated in our personal computer, for that reason this step had been
P. vivax has several genomic regions like the vir genes that
are hard to align because they
For this first step we require the following files:
.vcf.gz or .vcf files: If a
.vcf.gz is used, a .vcf.tbi will be also
required. For this tutorial the .vfc.gz file that we are
going to work with is:
ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz.
.gff: the gff file of the reference genome will be
also required in order to filter coding regions.
noncore_regions.bed: This file will be used to
remove non-core regions that are difficult to sequence from the
analysis.
load_libraries.R: This R script file load all the
required packages that we need to perform the analysis.
functions.R: This R script file load all the
functions that we have created to perform the analysis. Within those
functions we a function named fx_run_vcftools that allow us
to manipulate the vcftools package from the R environment.
We have also created other functions to get information from the
.vcf file with loading the whole file in R environment.
This functions will help us the get information regarding to the name of
the samples and create a samples.indv to use as a filter in
the analysis. Other functions will be explained later in this
tutorial.
Filtering_vcf_file.R: This R script file contains
the instructions to perform the analysis in R. This script requires 9
arguments:
wd: Working directory where all temporal files and
final outputs will be stored.
fd: Directory where the pre-required files are
stored.
gzvcf: Path to the .vcf.gz
file.
o: prefix used for the temporal and final
outputs.
rkeep: regular expression to identify samples of
interest.
ebed: name of the noncore_regions.bed
file. This file should be stored in the fd path.
gff: name of the .gff file. This file
should be stored in the fd path.
n: number of iterations. In the final step of this
script we are going to transform our data from VCF format to rGenome
format. This step is computing exhausting and consumes RAM memory. For
that reason we subdivide this step in n instances to be
able to run in the server.
thres: Minimum read depth to call an allele in a
sample.
r_options.sh: bash script with the instructions to
run R. This file should be stored in wd. The scrip should
looks as follow:
##!/bin/bash
# source /broad/software/scripts/useuse
# use R-4.1
# Rscript /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools/Filtering_vcf_file.R \
# -wd /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/ \
# -fd /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools \
# -gzvcf /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz \
# -o ColombiaGates \
# -rkeep (Col|SP) \
# -ebed Pvivax_P01_noncore.bed \
# -gff genes.gff \
# -n 500 \
# -thres 5
uger_options.sh: bash script with the instruction to
run UGER. This file should be stored in wd. The scrip
should looks as follow:##!/bin/bash
#qsub -l h_vmem=32G \
# -l h_rt=01:00:00 \
# -o /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/output/ \
# /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/r_options.sh
As a final output this step will generate and .RData
file containing the data in rGenome format.
setwd('~/Documents/Github/Plasmodium_WGS_analysis/Pre_filtering/')
source('../functions_libraries/load_libraries.R')
for(file in list.files('PerVen/Pviv/', pattern = '.RData')){
load(file.path('PerVen/Pviv/', file))
rm(list = ls()[!grepl('PvP01_(\\d+|MIT|API)_v1_rGenome_object', ls())])
}
rGenome_objects = list(PvP01_01_v1_rGenome_object,
PvP01_02_v1_rGenome_object,
PvP01_03_v1_rGenome_object,
PvP01_04_v1_rGenome_object,
PvP01_05_v1_rGenome_object,
PvP01_06_v1_rGenome_object,
PvP01_07_v1_rGenome_object,
PvP01_08_v1_rGenome_object,
PvP01_09_v1_rGenome_object,
PvP01_10_v1_rGenome_object,
PvP01_11_v1_rGenome_object,
PvP01_12_v1_rGenome_object,
PvP01_13_v1_rGenome_object,
PvP01_14_v1_rGenome_object,
PvP01_MIT_v1_rGenome_object,
PvP01_API_v1_rGenome_object)
rm(list = c('PvP01_01_v1_rGenome_object',
'PvP01_02_v1_rGenome_object',
'PvP01_03_v1_rGenome_object',
'PvP01_04_v1_rGenome_object',
'PvP01_05_v1_rGenome_object',
'PvP01_06_v1_rGenome_object',
'PvP01_07_v1_rGenome_object',
'PvP01_08_v1_rGenome_object',
'PvP01_09_v1_rGenome_object',
'PvP01_10_v1_rGenome_object',
'PvP01_11_v1_rGenome_object',
'PvP01_12_v1_rGenome_object',
'PvP01_13_v1_rGenome_object',
'PvP01_14_v1_rGenome_object',
'PvP01_MIT_v1_rGenome_object',
'PvP01_API_v1_rGenome_object'))
source('../functions_libraries/functions.R')
#sourceCpp('../functions_libraries/Rcpp_functions.cpp')
Pviv_rGenome_object = rGenome()
for(obj in 1:length(rGenome_objects)){
Pviv_rGenome_object@data$gt = rbind(Pviv_rGenome_object@data$gt, rGenome_objects[[obj]]@data$gt)
Pviv_rGenome_object@data$loci_table = rbind(Pviv_rGenome_object@data$loci_table, rGenome_objects[[obj]]@data$loci_table)
}
Pviv_rGenome_object@data$metadata = rbind(Pviv_rGenome_object@data$metadata, rGenome_objects[[obj]]@data$metadata)
rm(list = c('rGenome_objects', 'obj'))
# There are some spelling mistakes in the sample codes of the sequencing files, The table VENPER2019_2022.csv match the incorrect codes with the correct codes
external_metadata = read.csv('../GeneticDiversity/merged_metadata.csv')
PerVen_seq_codes = read.csv('PerVen/VENPER2019_2022.csv')
# Remove Duplicated records
external_metadata = external_metadata[!duplicated(external_metadata$Sample_id),]
PerVen_seq_codes = PerVen_seq_codes[!duplicated(PerVen_seq_codes$PreferedSampleID),]
# merge incorrect and correcte codes with metadata
external_metadata = merge(external_metadata, PerVen_seq_codes, by.x = 'Sample_id', by.y = 'AlternateSampleID', all.x = T)
external_metadata %<>% mutate(PreferedSampleID = case_when(
is.na(PreferedSampleID) ~ Sample_id,
!is.na(PreferedSampleID) ~ PreferedSampleID
))
# Add metadata to the rGenome object
Pviv_rGenome_object@data$metadata = merge(Pviv_rGenome_object@data$metadata, external_metadata[,c("PreferedSampleID", "Sample_id", "Study", "Country", 'Date_of_Collection', "Year_of_Collection",'Subnational_level0', 'Subnational_level1', "Subnational_level2")], by.x = 'Sample_id', by.y = 'PreferedSampleID', all.x = T)
# Sorting samples
rownames(Pviv_rGenome_object@data$metadata) =
Pviv_rGenome_object@data$metadata$Sample_id
Pviv_rGenome_object@data$metadata =
Pviv_rGenome_object@data$metadata[colnames(Pviv_rGenome_object@data$gt),]
3.1 Sample amplification rate
Pviv_rGenome_object = SampleAmplRate(Pviv_rGenome_object)
3.2 Check the distribution of the amplification rate of the samples using a histogram
Pviv_rGenome_object@data$metadata %>% ggplot(aes(x = SampleAmplRate, fill = Country))+
geom_histogram(position = 'stack', binwidth = .05, color = 'gray40')+
theme_bw()+
labs(title = 'Sample Amplification Rate distribution',
x = 'Amplification Rate')
3.3 Remove samples with less than 75% of amplified loci (> 25% of missing data)
Pviv_rGenome_object = filter_samples(obj = Pviv_rGenome_object, v =
Pviv_rGenome_object@data$metadata$SampleAmplRate >= .75)
4.1 Update allele counts
Pviv_rGenome_object@data$loci_table = cbind(Pviv_rGenome_object@data$loci_table,
get_AC(obj = Pviv_rGenome_object))
4.2 Filter monomorphic sites
Pviv_rGenome_object = filter_loci(Pviv_rGenome_object,
v = Pviv_rGenome_object@data$loci_table$Cardinality > 1)
5.1 Calculate the amplification rate of each loci
Pviv_rGenome_object = LocusAmplRate(Pviv_rGenome_object)
5.2 Check the distribution of the amplification rate of the samples using a histogram
Pviv_rGenome_object@data$loci_table %>%
mutate(CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = LocusAmplRate)) +
geom_point(alpha = 0.7, size = .25) +
facet_grid(.~ CHROM2)+
theme_bw()+
labs(y = 'Ampl. Rate', x = 'Chromosomal position')+
theme(legend.position = 'right',
axis.text.x = element_blank())
5.3 Filter loci with amplification rate below .75
Pviv_rGenome_object = filter_loci(Pviv_rGenome_object, v = Pviv_rGenome_object@data$loci_table$LocusAmplRate >= .75)
Pviv_rGenome_object@data$loci_table$ALT =
ifelse(Pviv_rGenome_object@data$loci_table$REF !=
gsub(',([ATCG]|\\*)+',
'',
gsub(':\\d+',
'',
Pviv_rGenome_object@data$loci_table$Alleles)),
gsub(':\\d+',
'',
Pviv_rGenome_object@data$loci_table$Alleles),
gsub('^([ATCG]|\\*)+,',
'',
gsub(':\\d+', '', Pviv_rGenome_object@data$loci_table$Alleles))
)
Pviv_rGenome_object@data$loci_table$TypeOf_Markers =
TypeOf_Marker(Pviv_rGenome_object, w = 1, n = 1)
Pviv_rGenome_object@data$loci_table$ObsHet = get_ObsHet(Pviv_rGenome_object, by = 'loci', w = 1, n = 1)
Pviv_rGenome_object@data$loci_table$frac_ofHet_pAlts =
frac_ofHet_pAlt(Pviv_rGenome_object, w = 1, n = 1)
10.1 Distribution of the fraction of alternative alleles in polyclonal samples
Pviv_rGenome_object@data$loci_table %>% ggplot(aes(x = frac_ofHet_pAlts))+
geom_histogram(binwidth = 0.01)+
labs(x = 'Fraction of heterozygous samples per alternative allele per site',
y = 'Number of sites (Loci)')+
theme_bw()
10.2 Classify sites based on the proportion of alternative alleles in
polyclonal samples per site
Pviv_rGenome_object@data$loci_table %<>% mutate(ALT_FILTER =case_when(
frac_ofHet_pAlts == 1 ~ '100%',
frac_ofHet_pAlts < 1 & frac_ofHet_pAlts > .5 ~ '50 - 99%',
frac_ofHet_pAlts <= .5 ~ '<=50%',
))
10.3 Distribution of observed heterozygosity per locus, by type of marker, by group (defined in previuos step 10.2)
Pviv_rGenome_object@data$loci_table %>%
ggplot(aes(x = ObsHet,
fill = factor(ALT_FILTER,
levels = c('<=50%', '50 - 99%', '100%'))))+
geom_histogram(binwidth = .01)+
scale_fill_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
facet_wrap(.~factor(TypeOf_Markers, levels = c(
'SNP',
'INDEL',
'INDEL:Homopolymer',
'INDEL:Dinucleotide_STR',
'INDEL:Trinucleotide_STR',
'INDEL:Tetranucleotide_STR',
'INDEL:Pentanucleotide_STR',
'INDEL:Hexanucleotide_STR'
)), scales = 'free_y', ncol = 4)+
labs(y = 'Number of Loci',
x = 'Observed Heterozygosity per locus',
fill = 'Het/Alt')+
theme_bw()
Pviv_rGenome_object = filter_loci(Pviv_rGenome_object,
v = !(Pviv_rGenome_object@data$loci_table$TypeOf_Markers %in%
c('INDEL:Homopolymer', 'INDEL:Dinucleotide_STR')))
# In that way we identify that maximum fraction of Heterozygous samples per loci is 0.4855491
Pviv_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDELs'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = factor(ALT_FILTER, levels = c('<=50%', '50 - 99%', '100%')))) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Het/Alt')+
theme(legend.position = 'right',
axis.text.x = element_blank(),
axis.title.x = element_blank())
Pviv_rGenome_object@data$loci_table = mean_ObsHet(Pviv_rGenome_object, gff = '../reference/genes.gff')
ObsHet_Threshold = quantile(Pviv_rGenome_object@data$loci_table %>%
filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
select(mean_ObsHet) %>%
unlist, .95)
Pviv_rGenome_object@data$loci_table %>%
filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
ggplot(aes(x = mean_ObsHet))+
geom_histogram(binwidth = .005)+
geom_vline(xintercept = ObsHet_Threshold)+
labs(y = 'Number of genomic regions', x = 'Average Observed Heteozygosity in SNPs') +
theme_bw()
Pviv_rGenome_object@data$loci_table %<>% mutate(ObsHet_Filter = case_when(
mean_ObsHet > ObsHet_Threshold ~ 'OUT',
mean_ObsHet <= ObsHet_Threshold ~ 'PASS'
))
Pviv_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = ObsHet_Filter)) +
geom_point(alpha = 0.5, size = .25) +
geom_hline(yintercept = ObsHet_Threshold)+
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity',
x = 'Chromosomal position',
color = 'Mean ObsHet > Th')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pviv_rGenome_object@data$loci_table = SNP_density(Pviv_rGenome_object, gff = '../reference/genes.gff')
SNP_density_threshold = quantile(Pviv_rGenome_object@data$loci_table %>%
filter(TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(SNP_density = max(SNP_density)) %>%
select(SNP_density) %>%
unlist, .95)
Pviv_rGenome_object@data$loci_table %>%
filter(TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(SNP_density = max(SNP_density)) %>%
ggplot(aes(x = SNP_density))+
geom_histogram(binwidth = .001)+
geom_vline(xintercept = SNP_density_threshold)+
labs(y = 'Number of genomic regions', x = 'SNP density') +
theme_bw()
Pviv_rGenome_object@data$loci_table %<>%
mutate(SNP_density_Filter = case_when(
SNP_density > SNP_density_threshold ~ 'OUT',
SNP_density <= SNP_density_threshold ~ 'PASS'
))
Pviv_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = SNP_density_Filter)) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'SNP density')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pviv_rGenome_object@data$loci_table %<>%
mutate(Alignment_Filter = case_when(
ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT' ~ 'OUT',
!(ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT') ~ 'PASS'
))
Pviv_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = Alignment_Filter)) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Alignment\nFilter')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pviv_rGenome_object@data$loci_table %>%
filter(Alignment_Filter == 'OUT') %>%
select(gene_id) %>%
unique()
## gene_id
## PvP01_01_v1_2358 PVP01_0100100
## PvP01_01_v1_5102 PVP01_0100200
## PvP01_01_v1_22901 PVP01_0100600
## PvP01_01_v1_32706 PVP01_0100800
## PvP01_01_v1_40406 PVP01_0101000
## PvP01_01_v1_47922 PVP01_0101100
## PvP01_01_v1_939461 PVP01_0120500
## PvP01_01_v1_944693 PVP01_0120600
## PvP01_01_v1_966877 PVP01_0121100
## PvP01_01_v1_971216 PVP01_0121200
## PvP01_01_v1_995573 PVP01_0121700
## PvP01_01_v1_1012267 PVP01_0122000
## PvP01_01_v1_1014104 PVP01_0122100
## PvP01_02_v1_424 PVP01_0200100
## PvP01_02_v1_2877 PVP01_0200200
## PvP01_02_v1_8478 PVP01_0200300
## PvP01_02_v1_21610 PVP01_0200500
## PvP01_02_v1_30614 PVP01_0200700
## PvP01_02_v1_94406 PVP01_0201900
## PvP01_02_v1_842499 PVP01_0219300
## PvP01_02_v1_857911 PVP01_0219700
## PvP01_02_v1_876987 PVP01_0220100
## PvP01_02_v1_880756 PVP01_0220200
## PvP01_02_v1_882193 PVP01_0220300
## PvP01_04_v1_6222 PVP01_0400200
## PvP01_04_v1_32197 PVP01_0400800
## PvP01_04_v1_37876 PVP01_0401000
## PvP01_04_v1_60396 PVP01_0401600
## PvP01_04_v1_118730 PVP01_0402800
## PvP01_04_v1_120730 PVP01_0402900
## PvP01_04_v1_127811 PVP01_0403100
## PvP01_04_v1_212026 PVP01_0404700
## PvP01_04_v1_1011571 PVP01_0424500
## PvP01_05_v1_123 PVP01_0500100
## PvP01_05_v1_11855 PVP01_0500400
## PvP01_05_v1_15821 PVP01_0500500
## PvP01_05_v1_32842 PVP01_0501000
## PvP01_05_v1_36220 PVP01_0501100
## PvP01_05_v1_1010779 PVP01_0524900
## PvP01_05_v1_1489237 PVP01_0534600
## PvP01_05_v1_1523875 PVP01_0535400
## PvP01_06_v1_1018344 PVP01_0624500
## PvP01_06_v1_1025964 PVP01_0624700
## PvP01_06_v1_1036654 PVP01_0625000
## PvP01_07_v1_14689 PVP01_0700200
## PvP01_07_v1_1490520 PVP01_0735600
## PvP01_07_v1_1517558 PVP01_0736200
## PvP01_07_v1_1527258 PVP01_0736400
## PvP01_07_v1_1548528 PVP01_0736900
## PvP01_07_v1_1569142 PVP01_0737500
## PvP01_07_v1_1605981 PVP01_0738300
## PvP01_07_v1_1618091 PVP01_0738600
## PvP01_07_v1_1651382 PVP01_0739300
## PvP01_08_v1_1698632 PVP01_0839700
## PvP01_08_v1_1707627 PVP01_0839900
## PvP01_08_v1_1712809 PVP01_0840000
## PvP01_08_v1_1737931 PVP01_0840600
## PvP01_09_v1_2187067 PVP01_0949400
## PvP01_09_v1_2201079 PVP01_0949600
## PvP01_09_v1_2233004 PVP01_0950200
## PvP01_10_v1_21881 PVP01_1000500
## PvP01_10_v1_36099 PVP01_1000700
## PvP01_10_v1_51220 PVP01_1001000
## PvP01_10_v1_59757 PVP01_1001200
## PvP01_10_v1_63476 PVP01_1001300
## PvP01_10_v1_67254 PVP01_1001400
## PvP01_10_v1_75970 PVP01_1001600
## PvP01_10_v1_79626 PVP01_1001700
## PvP01_10_v1_1340315 PVP01_1031200
## PvP01_10_v1_1345030 PVP01_1031300
## PvP01_10_v1_1349154 PVP01_1031400
## PvP01_10_v1_1521211 PVP01_1035400
## PvP01_11_v1_6687 PVP01_1100200
## PvP01_11_v1_14525 PVP01_1100400
## PvP01_11_v1_18005 PVP01_1100500
## PvP01_11_v1_2052777 PVP01_1147900
## PvP01_11_v1_2109141 PVP01_1148900
## PvP01_11_v1_2113665 PVP01_1149000
## PvP01_11_v1_2125519 PVP01_1149400
## PvP01_12_v1_2460 PVP01_1200200
## PvP01_12_v1_34668 PVP01_1200900
## PvP01_12_v1_38656 PVP01_1201000
## PvP01_12_v1_42765 PVP01_1201100
## PvP01_12_v1_808401 PVP01_1220000
## PvP01_12_v1_814582 PVP01_1220300
## PvP01_12_v1_3080582 PVP01_1272600
## PvP01_12_v1_3088130 PVP01_1272700
## PvP01_12_v1_3144435 PVP01_1274100
## PvP01_13_v1_4963 PVP01_1300100
## PvP01_13_v1_20486 PVP01_1300500
## PvP01_13_v1_2073285 PVP01_1347500
## PvP01_13_v1_2076982 PVP01_1347600
## PvP01_13_v1_2080441 PVP01_1347700
## PvP01_13_v1_2084606 PVP01_1347800
## PvP01_14_v1_169 PVP01_1400100
## PvP01_14_v1_26793 PVP01_1400700
## PvP01_14_v1_3009451 PVP01_1470400
## PvP01_14_v1_3099389 PVP01_1472000
## PvP01_14_v1_3103166 PVP01_1472100
## PvP01_14_v1_3107169 PVP01_1472200
## PvP01_14_v1_3122958 PVP01_1472500
## PvP01_14_v1_3127901 PVP01_1472600
Pviv_rGenome_object =
filter_loci(Pviv_rGenome_object,
v = Pviv_rGenome_object@data$loci_table$Alignment_Filter == 'PASS')
Pviv_rGenome_object@data$metadata$fracHet =
get_ObsHet(obj = Pviv_rGenome_object, by = 'sample')
Pviv_rGenome_object@data$metadata$Fws =
get_Fws(obj = Pviv_rGenome_object, w = 1, n = 1)
Pviv_rGenome_object@data$metadata %>%
ggplot(aes(x = fracHet, y = Fws))+
geom_point()+
geom_hline(yintercept = .97)+
geom_hline(yintercept = .88)+
geom_vline(xintercept = .015)+
geom_vline(xintercept = .06)+
facet_grid(.~Country)+
labs(x = 'Fraction of heterozygous loci')+
theme_bw()
Pviv_rGenome_object@data$metadata %<>%
mutate(Clonality = case_when(
Fws >= .97 ~ 'Monoclonal',
Fws < .97 & Fws >= .88 ~ 'Highly related Polyclonal',
Fws < .88 ~ 'Polyclonal'
))
rm(list = ls()[-grep('Pviv_rGenome_object',ls())])
save.image('../GeneticDiversity/PerVen_Pviv_filtered_rGenome.RData')